Improving a Page Classifier with Anchor Extraction and Link Analysis

نویسنده

  • William W. Cohen
چکیده

Most text categorization systems use simple models of documents and document collections. In this paper we describe a technique that improves a simple web page classifier’s performance on pages from a new, unseen web site, by exploiting link structure within a site as well as page structure within hub pages. On real-world test cases, this technique significantly and substantially improves the accuracy of a bag-of-words classifier, reducing error rate by about half, on average. The system uses a variant of co-training to exploit unlabeled data from a new site. Pages are labeled using the base classifier; the results are used by a restricted wrapper-learner to propose potential “main-category anchor wrappers”; and finally, these wrappers are used as features by a third learner to find a categorization of the site that implies a simple hub structure, but which also largely agrees with the original bag-of-words classifier.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pseudo-Anchor Text Extraction for Vertical Search

Anchor text plays a special important role in improving the performance of general Web search. The importance of anchor text comes from the fact that it is fairly objective description for a Web page by potentially a large amount of other Web pages. Vertical search provides indexing and search functionality for objects in a certain domain, and is becoming an important supplement for general Web...

متن کامل

NTCIR-5 WEB Navi-2 Experiments at Osaka Kyoiku University - Page, Anchor and Title Indexing, and In-link Count, Inter Page and Inter Site Link Analyses

This paper describes experimental results of WEB Navigational Retrieval Subtask 2 (WEB Navi-2). We made three gram-based indices, namely indices for text in whole page, text in title tag and text in anchor tag. Since gram-based indices are able to index all strings in target text, words that are not found in dictionaries are also indexed essentially. We used words in TITLE tag of search topics ...

متن کامل

Report on the TREC 2003 Experiment: Genomic and Web Searches

This year we took part in the genomic information retrieval and information extraction tasks, as well as the named page and topic distillation searches. In carrying out the last two tasks, we made use of link anchor information and document content in order to construct Web page representatives. This type of document representation uses multi-vectors to highlight the importance of both hyperlin...

متن کامل

Semantic Web Prefetching Scheme using Naïve Bayes Classifier

Web prefetching is an effective technique to minimize user’s web access latency. Web page content provides useful information for making the predictions that is used to perform prefetching of web objects. In this paper we propose semantic prefetching scheme that uses anchor texts present in the web page to make effective predictions. The scheme applies Naïve Bayes Classifier for computing the p...

متن کامل

Improving the performance of focused web crawlers

This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002